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Coding into a source: a direct inverse Rate-Distortion theorem 



I. Introduction 

In [1], Shannon proved that if there is a channel with capacity C > R{D), a source can be transmitted 
to within a distortion D reliably over this channel (-R(-D) is the rate distortion function for the source) in 
two steps: 

1) Suppose C = R{D — a). First, source code to within a distortion {D — |) by using random codes. 
The source code has rate arbitrarily close to R(D — |). 

2) Transmit these bits reliably^ over the channel. 

The traditional converse to this separation theorem is proved using the data-processing inequality and 
shows that no other joint source-channel scheme can do any better. 

We want to instead ask the converse question at the engineering leveP: if there is a "black box" over 
which an iid source Xj ~ -px can be transmitted to within a distortion level can we do reliable 
communication of bits (in the Shannon sense) over this "black box" at rates less than R(D)1 

If one assumes that the communication of Xj over the black box satisfies only an expected distortion 
constraint Ed{Xi,Xi) < D, then we cannot guarantee reliable communication. The black box should be 
viewed as an attacker and the attacker can do anything that it wishes as long as it meets the expected 
distortion constraint. 

Consider an equiprobable binary source {0, 1} under the Hamming distortion. Suppose the black box 
is constrained to communicate this source to within an expected distortion of 0.25. A possible attacker 
could flip a fair coin once at the beginning of time. If it is heads, then it transmits the symbols perfectly 
for all time; if it is tails, it just transmits for all time. It is then easy to see that one cannot do reliable 
communication over this attacker at any non-zero rate, whereas the rate-distortion i?(0.25) > 0. 

Thus, the expected distortion constraint is not sufficient. It turns out that a block distortion constraint 
is sufficient. If the attacker is such that^ 

Pr ^i^rf(Xt,Ft)j > D ^ as n oo (1) 

it can be proved that reliable communication is possible over this attacker at all rates less than R{D). 
This is the main theorem of this paper which is stated formally in Section HiH 

Following [2], one can draw an equivalence between all rate-distortion problems with a given value of 
R{D). Consider the collection of all iid sources and corresponding distortion levels, {Cp^Dp) such that 
RcpiDp) = Rq. If any one of these sources can be communicated over an attacker such that the block 
distortion criterion dU) holds, then all of them can be communicated to within a distortion level (-D/3 + S) 
over this same attacker, for arbitrarily small positive 5. One way to show this is: 

1) Source code one source to within the distortion level Dp + S hy using less than uRq bits. 

2) Communicate these uRq bits reliably by embedding them into the source accepted by the attacker 
and recovering them from the distorted sequence. 

In Section |lll we state the precise formulation of the above problem. In Section Unl we state our main 
theorem. In Section HVl we state the connection of the formulated problem to coding theory, arbitrarily 
varying channels and to watermarking with no covertext. In Section |Vl we prove the theorems stated 

'The bounded nature of the distortion function only becomes important if we are interested in end-to-end expected distortion. If all that 
is desired is for the probability of excess distortion to be arbitrarily small, then no such assumptions are needed. 

^Fundamentally, we are asking whether reliable lossless communication is necessarily the right primitive that defines layering in a 
multipurpose communication system. Could lossy coding serve as an equally good primitive in principle? 

'For simplicity of notation, the dependence of the attacker on block-length n is suppressed. To be precise, Q should be interpreted as 
a family of attackers indexed by n such that the probability of excess distortion can be made as close to zero as desired by choosing an 
attacker with an appropriately large n. This parallels the existence result for channel coding. 
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in Section UTIl and comment on them in Section |VI1 Section IV 111 formulates a conditional version of 
the theorem and it is proved in Section IVIIII In Section HXl we state the relation of this problem 
to watermarking. Section [X] shows how to generalize to the case of non-finite sources with difference 
distortion. Section IXll shows how the results can be easily extended to stationary ergodic sources that 
mix appropriately. 

Because of space limitations, some of the details in the later sections are omitted. The full proofs can 
be found in [3]. 

II. Problem Formulation - Unconditional case 
We start with some notation: 

• X = {1,2, . . . , finite set. is the input space. 
. 3^ = {1, 2, . . . , 13^1} ^ finite set. 3^°° is the output space. 

• Px ^ probability distribution on X. 

• — > iid sequence of random variables, each Xj ~ px- 

• d : X X y ^ TZ is a non-negative valued function. We should think of d{i,j) as the distortion 
between i E X, j E y. The focus is on the average additive distortion on n-sequences, ^dn^x^, y^) = 
^ELi d{xt,yt). 

• The Attacker is a black box which takes in the input sequence xf E X^ and produces an output 
^oo g -yoo ^oo j^ee^j j^Qt \yQ a deterministic function of xf\ it can be randomized. 

Note that the attacker is, in general, non-causal in the sense that it takes in the whole input sequence, 
looks at it, and produces an output sequence. The situation that the attacker looks at xf" and produces 
is the most general possible. In practice, the attacker will only look at finite length sequences 
and produce an output; this is a special case of our definition. 

The attacker can also be viewed as a channel. We will use the words attacker and attack channel 
interchangeably. 

• D-distortion attacker — > If the input to the attacker is the random variable sequence (defined 
above - each Xi iid px), the attacker produces the random variable sequence Y^. This results in 
a joint probability measure on {X^,Y^). Under this probability measure, there should exist some 
function f{n) with Um„^oo/('^) = so that: 

t+n-l \ 

- 5^ d{Xu,yu)>D \ <f{n) (2) 

The above equation says that the average distortion caused to long sequences is bounded by D with 
high probability, and this probability — > 1 at least as fast"^ as 1 — f{n) with increasing block lengths 
n uniformly over at which time this sliding block^ is taken (hence, the name D-distortion attacker). 
Note that on an individual symbol level, the attacker is essentially unconstrained — for any Xt, the 
attacker can distort it really badly. It is only constrained over very long blocks. 

• px ± e will denote the set of all probability measures qx on X such that \qx{i) —Px{i)\ < eVi E X. 
As we can see, the rate-distortion problem when the input sequence is iid px is solved (in the sense of 

[2] by this attacker for distortion value D. The question we want to ask is, "Can we transmit bits reliably 
over this attacker in the Shannon sense, and if yes, at what rates?" 

''No restrictions are made on how fast /(n) tends to zero — just that we know how fast this probability goes to zero for this particular 
family of attackers so that we can pick an appropriate block-length for the code. 

^The purpose of the sliding block is merely to reduce notation in stating the condition. All theorems will be proved within a single block 
of length n that is sufficiently long on its own. This can be repeated with disjoint blocks if a stream of data needs to be transmitted. 
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III. Main results - unconditional case 

Theorem 1: Assuming that there is common randomness available at the transmitter and the receiver, 
all rates 

R<Rx{D)= inf I{X;Y) (3) 

X r^PX 

Ed{X, Y)<D 

are achievable over a /^-distortion attack channel, and in fact, this can be done by using iid px random 
codes. 

The above theorem says that we can solve the Shannon communication problem over a D-distortion 
attacker at all rates less than the rate distortion function, Rx{D). We comment on the need for common 
randomness in Section IVTI after we prove the above theorem. 
We also have a converse theorem: 

Theorem 2: Rates larger than Rx{D) can in general not be achieved over a /^-distortion attacker. 
After a few comments about this formulation in the next section, it is proved in the section after next. 

IV. Connections to AVCs and Watermarking 

We can view the attacker as a non-causal arbitrarily varying channel (AVC). The AVC is constrained 
in such a way that it distorts most input sequences to an average distortion less than or equal to D where 
"most" is according to the iid px measure over the input sequences. The question that we are asking 
is, "What is the capacity of this AVC?" The foundational papers on AVCs are the papers by Blackwell, 
Breiman and Thomasian, [4], [5]. [4] considers the case when the channel is a fixed DMC coming from 
a particular set, but unknown. [5] considers the case when the channel can vary arbitrarily, but is a DMC 
at each time, and comes from a particular set based on past history unlike in our case where the attack 
channel at each time does not come from a particular set, nor is it causal. Stiglitz [6] has the same setup 
as [5], but calculates error exponents. Csiszar and Narayan [7] uses a minimum distance decoding rule 
similar to the one that we will use, but it does not consider AVCs in the form that we do. 

To the extent that minimum distance is the relevant idea, this work can also be considered a general- 
ization of the original formulation of coding theory in [8] with the distortion measure generalizing the 
Hamming distance. In addition, the composition of the codewords is specified in advance. Fundamentally, 
Theorem U] says that every rate-distortion problem is also associated with a coding theory problem. 

This paper's formulation can also be viewed as a watermarking problem ([9]) with no covertext. The 
goal is to embed our data in the input to an attacker that acts within a distortion constraint. [10] by Somekh- 
Baruch and Merhav is the closest to our work. It allows for non-causal attackers and the definition of 
attacker is very similar to ours. But [10] does not use a minimum distortion decoding rule — they use 
another decoding rule which is superior in the sense that it achieves the best possible error exponent. We 
believe that proofs in [10], with slight modification, should be applicable in our scenario too, but we use 
a different decoding rule (a variant of minimum distance decoding) since it is arguably more natural and 
achieves capacity. The distinction between the two papers is more significant in the conditional case. 

V. Proofs - unconditional case 

We first prove Theorem [T] stated in Section |III| and show that by using px random codes, we can 
transmit reliably (in the Shannon sense) at all rates R < Rx{D) over the D-distortion attack channel. 
Codebook Construction: Generate 2"^ codewords iid px- This is the codebook, which we denote by 

C. 

Decoding: Fix e > 0. Restrict attention to those codewords which are px-typical, that is, whose type 
lies in ± e (recall the definition of ± e in Section |ll| all qx such that \qx{i) — — ^ '^)- 

Denote this restricted set of codewords by Cr. 
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Fig. 1. The sorted received sequence y" and the correspondingly shuffled codeword illustrating the relevant types. 

Let denote the output of the attacker. If there is a unique px-typical in the codebook which is at 
an average distortion less than or equal to D from the output sequence, declare that was transmitted, 
else declare error. 

We call our decoding rule the "e-Nearest Typical Neighbor" decoding rule. The truly nearest neighbor 
decoding rule might be a bit more natural, but it is harder to analyze. 
In what follows, 

• x" denotes the transmitted codeword. 

• Hi denotes the received sequence (output of the attacker). 

• 2;" denotes a px typical codeword (that is, z"' G Cr) such that is NOT transmitted. 
The error event can be decomposed into 3 parts. 

• El ^ transmitted codeword atypical: ^ Cr. 

• E2 ^ Distortion caused by the attacker is not typical: ^ X]"=i d{xt, yt) > D. 

• E3 a typical codeword which is not transmitted is at an average distortion less than or equal 
to D from the received sequence. Mathematically, 3z" E Cr such that z"^ is not transmitted and 



Clearly, Pr(error) < Pr(i?i) + Pr(-E2) + Pr(-E'3). By the weak law of large numbers, Pr(_E'i) ^ as 
n ^ 00. Pr(£'2) ^ as n ^ 00 follows by the definition of D-distortion attacker ©. To upper bound 
Pr(£'3), we do a type-based calculation [11] on the probability of error for a given received sequence y". 

In what follows, it will be helpful to remember that q will always denote probability measures with 
observed types, whereas p will always denote probability measures with transmitted types. Recall that 
the received sequence is y'^. Let the type of ?/" be gy, that is, Vj G y, the number of j occurring in 
is ngy(j). 

Sort the output to place all the j E y together, and correspondingly shuffle the positions in the 
codebook's codewords. This leads to no change in distortion between shuffled codewords and the sorted 
received sequence y". 

Look at a generic shuffled codeword G Cr which is not transmitted. Over the chunk of length 
nqyii), let the type of the corresponding entries of z^ be qx\Y=j- (See Figure [T]) 
For the error event E^, 
1) 2;" is typical, that is. 



n 



TJi=i di^t, yt) < D. 




(4) 



Denote Ejey 9y(j>x|y(«|j) as qx{i). Thus, 



qx epx±e 



(5) 



2) 2;" is at an average distortion < D from the received sequence so 




(6) 



Denote the distribution qY{j)qX\Y{i\j) on A" x by qxyii.j)- Thus, 

E^^^d{X,Y)<D 



(7) 
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Let us now bound the probability of event ,^3. 

First, the probability that over the chunk of length ngy(j), the corresponding entries of have type 
Qx\Y=j (recall that px is the generating distribution of codeword Z") is given by: 

Thus, the probability that over the whole block of length n, in the chunks nqyiJ), the corresponding 
entries of have type qx\Y=j, for all j 

< nje3;2~"'?^(^')^(''^i^=^IIP^) (9) 
= 2-'^T,jeylY(j)^(lx\Y=j\\px) (^iQ-^ 

— 2~"^('?^yllP^'^'*') (11) 

It would be helpful to note the positions of where p occur and where q occur, in the above expression. 

To bound the probability that 2;" is at a distortion < D from y", we have to sum the above probability 
over all possible types qx\Y=j, 1 < j < |3^| such that conditions 1 and 2 above (equivalently, dSj) and Q) 
are satisfied. 

Number of qx\Y=j types < {n + Also recall that number of non-transmitted codewords \Cr\ < 

Putting all this together and using the union bound, 

Pr(£;3| typeof is gy) (12) 

where S denotes the set of types satisfying conditions 1 and 2 (equivalently, ^ and and is 

{qx epx±e ^ 
qxY-. E,^^.d{X,Y)<D \ (13) 
qy fixed J 

Now, qy, the type of the received sequence is arbitrary. Thus, an easy way to bound Pr(£'3) is to 
just remove the qy fixed condition from the above definition of S. 
Thus finally, 

Pt{Es) < {n+ i)l'^l(l^l+i)2"'^2""™^«^i'S^'°^'^-^^"^^'^^^ (14) 

where T is the set 

The only difference between the sets S and T is that the qy fixed condition which exists in S has been 
removed in T. 

Since {n + l)l'^l(l^l+i) is a polynomial, Pr(£'3) as n — > 00 if 

R< inf D{qxY\\pxqY) (16) 

qx epx±e 

Ed,,,{X,Y)<D 
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Thus to prove Theorem [H it suffices to prove that 

©1 = lim inf D{qxY\\pxqY) (17) 

E^^^d{X,Y)<D 

= Rx{D)= inf I{X;Y) 

X ^px 
Ed{X, Y)<D 

= inf D{pxy\\pxPy) = ^2 

Px fixed 
Py can vary 

Ep^^d{X,Y) < D 

The main difference between 6i and ©2 (note the definitions of ©i and ©2 in the above equation) is that: 

. In ©1, we have D{qxY\\pxqY)\ qx ^ Px ± e 

. In ©2, we have D{pxy\\pxPy) 
It is clear that ©1 has "more freedom" and hence, ©1 < ©2. 

All we need to prove is that ©1 > ©2. 

This we do with a simple trick: 

D{qxY\\pxqY) = D{qx\\px) + D{qxY\\qxqY) (18) 

> D{qxY\\qxqY) 

Thus, 

©i>lim inf D{qxY\\qxqY) (19) 

qx ^Px ±e 



+0 



E,^^d{X,Y) < D 



So we only need to prove that 



lim inf D{qxY\\qxqY) (20) 

E^^^d{X,Y)<D 

> lim inf D{pxy\\pxPy) 

Px fixed 

Edlx, Y)<D 

This holds with equality, and follows from the continuity of the rate distortion function Rx{D) in px 
and proves the direct theorem. 

The sequence of choosing ra, e depending on the rate R < Rx{D) and probability of error pe is: 

1) Choose e small enough so that R < mfx£px±e Rx{D). 

2) Choose n large enough so that the total probability of error from the events Ei, E2 and E3 adds 
up to a value less than pe. 

We now sketch the proof of the converse theorem. Theorem |2l that is, in general, we cannot transmit 
at rates larger than Rx{D) over a D-distortion attacker. Another way of stating this is that if one tries to 
transmit at rates larger than Rx{D), there is a D-distortion attacker such that we cannot transmit reliably 
over this attacker. 

First, consider the case that we are restricted to using iid px random codes; we will remove this 
restriction later. 

Let the rate at which we want to transmit, R = Rx{D — a) > Rx{D) for some a > 0. 

We will show that there is a D-distortion attacker which is a DMC for which error probability ^ 0. 
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Look at all DMCs that produce an average distortion of {D — |) between the input and output when 
input is px distributed. 

Cworst= inf I{X;Y) (21) 

X Px 

Ed{X, Y)<{D- f ) 

But this value is precisely Rx{D — |). Also, any DMC that produces an average distortion of {D — |) is 
a D-distortion attacker (follows from the weak law of large numbers). Thus, we have exhibited a DMC 
which is a D-distortion attacker and over which, we cannot reliably at rates larger than Rx{D — f ) < 

Rx{D -a) = R. 

To remove the assumption that we have to use px random codes, consider the following attacker: 
Fix e > 0. The attacker looks at inputs of length n and if the input is not px typical (that is, the 
empirical type does not lie in px ± e), the attacker will produce junk output, say the all 1 sequence, 
whereas if the input sequence is -typical, the attacker will act like the above DMC. The attacker needs 
to keep increasing the length of sequences which it looks at and attacks, and correspondingly decrease 
e. It is intuitively clear that if a codebook is chosen with a codeword which is not px -typical, the output 
of the attacker will give no positive rate information about what was transmitted, and hence, the encoder 
can not use such codewords to transmit reliably at rates larger than Rx{D). 



VI. Comments on the proof 

If one compares the proofs of Shannon's channel coding theorem and the above, the two are quite similar 
in the error calculation for the event E^, but there is one difference. In Shannon's theorem, proving that 
the average error probability over the ensemble of codes — > implies that there exists a codebook for 
which the error probability — > for every single message. This is not immediately true in our case because 
the attacker can use different strategies over different blocks. 

Furthermore, if we were to use the same codebook over and over again, the input would no longer 
look iid Px on very long sequences and the attacker would be free to just drive us to zero. Thus, the 
codebook has to be generated at least somewhat independently in each block of length n. This is where 
we use the assumption that there is common randomness available — using this common randomness, 
the transmitter and the receiver can generate the codebook again and again, independently. 

However, the code as given requires an exponentially large amount of common randomness. This can 
easily be reduced to a polynomial (in the block-length n) amount of common randomness by using the 
following tricks: (details in [3]) 

• Simulate in advance whether the input block will be e-typical or not. (Can use O(logn) bits) If it is 
atypical, just declare error no matter what message was sent. 

• Make slight modifications to the proof to instead show the existence of deterministic codebooks 
with input types like ± e that can be list-decoded to some possibly large, but constant, list-size / 
when facing a worst-case attacker inducing a distance D. This is done by patching the above proof 
with arguments analogous to those for Theorem 5.1 in [12]. The additional trick is just noticing that 
I{X; Y) = H{Y) - H{Y\X) and that 2"^^^) is essentially the total number of output sequences of 
type qy. When / is large enough, j^H{Y) — H{Y\X) is as close as desired^ to Rx{D). 

• Once the deterministic codes are constant composition, a random permutation of the indices will make 
each of them behave as though they were drawn from the original iid p^ distribution conditioned on 
the empirical type being typical. This takes 0(nlog(n)) commonly-random bits. 

^Rather than computing the probability of error, we are computing the expected number of _D-balls that have at least / + 1 codewords in 
them. For a given I + 1 codeword positions, this is just the existing probability of collision raised to the Z -|- 1 power times the number of 
possible /J-balls. The total number of such combinations is also no more than 2"^'^'+'^^ 

^And so the expected total number of colUsions is as small as we want and so there exists at least one deterministic codebook that has 
no such collisions at the Z-list level. 
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• By using the code at a rate slightly less than the rate of the code, the message can be padded with 
a randomly chosen hash of the true message. This takes at most another 0{n) commonly -random 
bits and allows the decoder to uniquely disambiguate the decoded lists with high probability by just 
rejecting messages whose hashes do not match up correctly. 

VII. Theorem - conditional Case 

Until now, we assumed that the input to the attacker should be a px-iid sequence. Now, consider the 
case that the input is still an independently generated sequence but the distribution of Xi depends on an 
iid random variable sequence that is revealed non-causally to all parties. 

We state some notation to add to the notation previously. 

• V {l,2,...|V|}isa finite set. A generic element of V will be denoted by s. 

• pv probability distribution on V. 

• — > iid sequence of random variables generated pv- In watermarking terms, this can be thought 
of as the "cover- story." We will talk about relations to watermarking in Section |IX| 

• Px\v=s — If ^ = s, Xj is generated according to the distribution px\v=s, but independently of other 
Xj. The joint distribution on {Vi^Xi) will be denoted by pyx 

• Attacker We assume that Vi is known noncausally to the encoder, decoder and the attacker. 
The next theorem is a conditional version of the inverse rate-distortion theorem. Theorem [l] 
Theorem 3: Assuming that there is common randomness available at the transmitter and the receiver, 

all rates 

R<Rxiv{D)^ inf I{X;Y\V) (22) 

(V, X) ~ pvx 
Ed{X, Y)<D 

are achievable over a D-distortion attack channel, and in fact, this can be done by using iid px\v random 
codes. 

We omit a converse theorem though the same arguments as above would give one. 

VIII. Proofs - conditional case 

The proof is very similar to the proof of the theorem in the unconditional case. Recall that V-^ is 
known to the transmitter, receiver, and attacker. 

Codebook Construction: Generate 2"^ codewords iid px\v- This is the codebook, which we denote 
by C. 

Decoding: Fix e > 0. Restrict attention to those codewords such that (f",a;^) is pvx typical, that 
is, whose type lies in pvx ± e- 

Denote this restricted set of codewords by Cr. 

Note that if is not typical, Cr will be empty. Thus: 

• The definition of Cr implicitly assumes an error if is not strongly typical. 

• Cr depends on f", that is, the codewords of C which lie in Cr are different for different f". 

Let denote the output of the attacker. If there is a unique x" in the restricted codebook which is at 
an average distortion less than or equal to D from the output sequence, declare that x" was transmitted, 
else declare error. We call this the "e-Nearest Conditionally Typical Neighbor" decoding rule. 

In what follows, will denote a non-transmitted codeword as before. As in the unconditional case, 
the error event consists of 3 parts: 

• El ^ (v'l, Xi) is not typical. This is a slight modification of Ei in the unconditional case. 

• E2 ^ Distortion caused by the attacker is not typical, that is, transmitted codeword is at an average 
distortion larger than D from the received sequence. Mathematically, ^ XltLi ^ixt, yt) > D. This is 
exactly the same as in the unconditional case. 
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Fig. 2. The various types illustrated in the conditional rate-distortion case. 



• E3 a typical codeword which was not transmitted is at an average distortion less than or equal to 
D from the received sequence. This is exactly the same as in the unconditional case. 

Pr(error) < Pr(Ei) + Pr(E2) + Prl-Es). Pr(Ei), Pr(E2) as in the unconditional case. 

All we need to do is to upper bound Pr(i?3). As before, we do a method-of-types calculation on the 
probability of possible that will cause an error for a given received sequence y". 

The only essential difference between this proof and in the proof of the unconditional case is that we 
first do a sorting based on V and then proceed exactly the same as before, that is, do a sorting based on 
Y and then do a sorting based on X. 

Let the type of f" look like qv- Sort, so that all t such that = s are together. Over the subsequence 
where Vt = s, let the type of the output produced by the attacker be qY\v=s- Again, do a sub-sorting such 
that all Yt = j are together in each subsequence of Vt = s. In this (Vt = s,Yt = j) subsequence, let the 
type of the subsequence of (recall - is a codeword which is NOT transmitted) look like qx\Y=j,v=s- 
See Figure |2l 

We now do the Fi^Es) calculation. 

First restrict attention to the subsequence Vt = s. Over this subsequence, we do exactly what we did 
in the unconditional case. It follows from the proof of the unconditional case that the probability that 
looks like qx\v=s,Y=j given that the ?/" subsequence type looks like qY\v=s is 

^ 2-nqv(s)D(qxY\v=s\\Px\v=sQY\v^s) (^23) 

The probability that over the whole sequence, the type is qx\v,Y given that the Y type is qY\v 

< 2^^'^sev1v(s)D{qxY\v=s\\Px\v=sQY\v=a) ^24) 

= D{qxY\v\\Px\vqY\v\qv) (25) 

There are a polynomial number of qvxv types, < (n + l)!^!!-^!!^! and by argument similar to that in the 
unconditional case, 

Pr{Es) < (26) 

where the set IZ over which the above infimum is taken is: 

1) (-u", Zi) is typical, that is, qvx ^ Vvx ± e- 

2) z'l is at an average distortion < D from the received sequence y", that is, Eg^^^diX, Y) < D 
Thus, 

1? - In . qvx epvx±e \ 
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It follows that we only need to prove that 

lim inf D{qxY\v\\Px\vqY\v\Qv) (28) 

qvx e pvx ± e 
E,,^^d{X,Y)<D 

= Rx\v{D)= inf I{X-Y\V) 

(V, X) ~ Pvx 
Ed{X, Y)<D 

= inf D{pxy\v\\Px\vPy\v\Pv) 

Pvx fixed 
Ed,,^^{X,Y)<D 

The proof of this follows in almost the same way as in the unconditional case, just that we have to use 
the continuity of Rx\v{D) in pvx (in the unconditional case, we had used the continuity of Rx{D) in 
Px)- 

This proves the conditional theorem. Theorem |3l 

IX. Relation to Watermarking 

We can view this conditional problem as a watermarking problem with a coverstory^. In watermarking, 
the user is allowed to make some tolerable level of distortion to the covertext. We have a restriction 
of another kind, that is, if the coverstory entry is s, the input distribution should be px\v=s- Also, in 
watermarking, the covertext is not known to the attacker.^ We assume that the covertext is known to the 
attacker. If one looks at (38) in the paper of Somekh-Baruch and Merhav [10], this is the reason for the 
Markov Chain condition U — > X F. We do not have the Markov Chain condition V X Y 
because the covertext V is known to the attacker. 

X. Continuous alphabets 

In this section, we consider the case when X, y and V are not necessarily finite discrete alphabets. We 
divide the problem into 6 cases: 

1) X finite, y finite, V not there. 

2) X finite 3^ finite, V finite. 

3) X non-finite, y non-finite, V not there. 

4) X non-finite, y non-finite, V finite. 

5) X finite, y finite, V non-finite. 

6) X non-finite, y non-finite, V non-finite. 

We will refer to these as Cases 1 through 6. Case 1 is the unconditional case covered in Theorem [H 
Case 2 is the conditional case covered in Theorem |3] We now go on to the rest. The proofs will be based 
on quantization of the above sets and using ideas from the proofs of Theorem [l] and |3] 

For Case 3, we need to prove that rates < Rx{D) are achievable and for Cases 4,5,6, we need to prove 
that rates < Rx\v{D) are achievable. 

Figure |3 is a dependency graph of which proofs depend on which. 

*To distinguish it from the "covertext" in traditional watermarking 

'since otherwise, presumably the attacker could just replace the input with the covertext itself. The same is not true if it is considered as 
a coverstory. 
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Fig. 3. Dependence graph for the proofs of the various cases 



A. Compact support 

We first tackle Case 3, that is, X, y are non-finite sets, and there is no V. We first assume that X and 
y are bounded subsets of TV, for some positive integer 7. The case of unbounded support is addressed 
later. 

We first state some notation: 

• X,y ^ bounded subsets of 7^^. 

• a — > generic point in X. We do not use x because of potential confusion with the transmitted 
sequence. 

• 6 — > generic point in y. We do not use y because of potential confusion with the received sequence. 

• d : TV X TV — > 7?. is a difference distortion measure which is assumed to be uniformly continuous 

with respect to the Euclidean metric. 

• D-distortion attacker — Same as before. If the input to the attacker is the random variable sequence 

( Xi iid px) , the attacker produces the random variable sequence Y-^. This results in a joint 
probability measure on {X^,Yf°). Under this probability measure. 



t+n-l \ 

- d{X^, y„) > D I ^ as n ^00 (29) 



• ^A, 3^A A-hypercube grid quantization of X,y respectively. The boundary of the hypercube can 
be put in any of the adjoining sets but not both. The quantization point is taken as the center of the 
hypercube. 

• oa — > Generic point of A'a. oa £ A'a is obtained by quantizing a E X. 

• 6a — ^ Generic point of 3^a- &a G 3^a is obtained by quantizing b G y. 

• PXa ~^ Probability distribution on X^ obtained from the distribution px on X in the obvious way. 
Note that since the difference distortion function is uniformly continuous and X,y are bounded, 

^ Er=i ^(^tA, VtA) < ^ E"=i d{xt, yt) + <7(A)V(a;^ y^) e A'" x J^" where ^(A) ^ as A ^ 0. 
It follows that under the distribution governing {X'^^,Y{^^ under the D-distortion attacker, 

t+n-l \ 

- J] d{XuA, Yua) > D + g{A)\ ^ as n ^ 00 (30) 

If we work in the quantized world, this suggests what the decoding rule should be. 

Codebook Construction: Generate 2"^ codewords iid px- This is the codebook C. Let Ca denote the 
quantized codebook obtained by quantizing each codeword. 

Decoding: Fix e > 0. 

Restrict attention to those quantized codewords which are p^A ■typical. Denote this restricted set of 
quantized codewords by C^. 

Let ^"a denote the quantized output of attacker. If there is a unique px^-typicdA quantized codeword 
x"^ which is at an average distortion less than or equal to D + g{A.) (note the change D + g{A.) instead 
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of D) from the output sequence, declare that x'^ was transmitted, else declare error. Mathematically, if 

3\x^ G Cb such that J2t=i d-i^'t- Ut) 1^ D + (?(A), declare that was transmitted, else declare error. 

This decoding rule has reduced the problem to Case 1 (finite X and 3^), and we can use results from 
there. Thus, we can transmit at rates R < Rx^{D + g{A)) using this decoding rule. It can be shown 
using the appropriate continuity arguments that limA^o^XA(^ + 9{^)) — Rx{D). This proves that we 
can transmit at all rates < Rx{D). 

The sequence in which n, A, e need to be chosen depending on the desired rate R < Rx{D) and the 
error probability Pe is: 

1) Choose A small enough so that R < Rx^i^ + g{A)). 

2) Choose e small enough so that R < mix^epx^±e Rxa{D + fl'(A)). 

3) Choose n large enough so that the sum of error probabilities of events Ei, E2, E3 is less than pg. 
Case 4, where X,y are non-finite while the "coverstory" V is finite, is proved in exactly the same way 

— by quantizing X, Y finely enough. 

Next we consider Case 5, that is, X, y are finite and V is non-finite. We assume that V is a bounded 
subset of TZ'^ for some positive integer 77. 

We introduce some notation regarding V. 

• V — > bounded subset of TZ^. 

• s ^ generic element of V. 

• Va' — > A' hypercube quantization of V. The boundary of the hypercube can be put in any of the 
adjoining sets. Quantization is taken as the center of the hypercube. We use A' instead of A because 
we use A for quantizing X and y. 

. Generic point of Va'- sa' in Va' is got by quantizing s in V. 

• Sa' — ^ quantization region (hypercube) of V containing the point sa' G Va'- 

• Pv^, probability distribution on Va' got from pv on V in the obvious way. 

What is not obvious, though, is how to define Px\v^,- We need to make definitions in such a way that 
we can do probability of error calculations for the event E^ (the other two events, Ei and E2 will be 
trivial as usual). 



,inf 

sup 



Px\v^,=s^, = Px\v{i\s), i e X (31) 

Px\v^,=s^, = inf px\v{i\s), i e X (32) 



Px\v i=s , general, a probability measure. It is a measure with mass > 1 and denotes a 

measure w^ich "dominates" all probability measures px\v=s over the quantization region of V which 
contains sa'- 

P^x\v i=s general, a probability measure. It is a measure with mass < 1. It denotes a measure 

which ^is ^minated by" all probability measures px\v=s over the quantization region of V which contains 

SA'- 

Intuitively, if we make some continuity assumptions on px\v=s as s G V varies, then p5f|v'^,=s^, 
P^x\v ,=s I close to each other. For small enough A', all s G Sa' are almost the same in the 

distnbution induced on X. 

Another reason for defining p^^y , is that it helps us to do error probability calculations. This is 
demonstrated by the following lemma: 

Lemma 1: Letpx be a probability distribution on X. Let nx be a measure on X such that Hxii) > Px{i) 
for alH G A" (that is, nx dominates px)- Let qx be another probability distribution on X. 

Then, probability that an n length sequence generated iid px has type qx 

P\{T{qx)) < 2-"^(«^ll'^^) (33) 
where D{qx\\i^x) is defined in the obvious way, D{qx\\i^x) = Y.iex (Ixii^ogj^ 



13 



Proof: 

Pxinqx)) 

< 2~"^('?^ll*'^) (by method of types) 

< 2""^('?^ll^^) (trivial by definition of D{qx \ \fxx)) 

This lemma gives us a way of upper bounding the error probability of a type class when we do not 
know the generating distribution, but have an upper bound on the same, and this is precisely the situation 
we are in. 

We define p'x^y , as the probability measure obtained by normalizing p'^x^y 

If we have some continuity conditions (which we will make rigorous later) on px\v, as measures, 
Px\v^,=s^r Px\v^,=s^,' PxVa'=«a'' {P^\y=^^^^ '^A'} will be quite close to each other. 

Also, the distributions pv^, and p^^^y , result in a probability distribution on (Va',-^) which we 
denote by py'^^^- 

Next, we state the codebook formation and decoding rule: 

Codebook Construction: Generate 2"^ codewords iid px\v- This is the codebook C. 

Decoding: Fix e > 0. Restrict attention to those codewords that (x",f"^,) have an empirical type 
qx,v^, that is typical. Denote this restricted set of codewords by Cr. 

Let denote the output of the attacker. If there is a unique in the restricted codebook Cr that is at 
an average distortion less than or equal to D from the output sequence, declare that was transmitted, 
else declare error. 

We impose the following technical condition'" on , and Px\v ,=s ,' which captures mathe- 

matically, the closeness of px\v=si and px\v=s2 for and S2 close. 
Technical Condition: \/i e X 



lim max 



sup / • I \ sup / • I 

Px\V^,=s^, I ^ A' ) - Px\V^,=s^, I ^A' 



(34) 



This condition says thatp^l^ ,(*|sa') ~p'x[v ,=s ~^ ^ ~^ ^ uniformly over all partitions 

of V. 

We now do the probability of error calculations. 

It is easy to check that with the above decoding rule, the probabilities of error event E2 ^ as n oo. 
For El, all that is required is for n to be large enough while e is also large enough relative to A' so that 
[Pv^iX ~ hPv^/X + f ] ^ Pvl^,x ^ ^- ^^^^ point, the weak law of large numbers is enough to guarantee 
what is desired. 

For Pr(ii^3), we follow the steps in the proof of Case 2 {X finite y finite, V finite) (Theorem ISj and 
use Lemma [T] to replace px\v with P^iy^, • follows that we can transmit at rates 

i?<lim inf D (qxY\v^MPx'\v^,qY\v^,\qv^,) (35) 

qv^,x epyl,x±^ 

First thing that we need to take care of p^x[v , appearing above - we want to somehow replace it by 
p^x^y Using the technical condition (l34b . it is easy to see that there is a function h such that we can 
transmit R < 

lim inf D(qxY\vJ\Px\v^,qY\vJqv^^-HA') (36) 

qv^,x e py^x ± e ' 
Ed,,^,,AX,Y)<D 

'"it can be shown to be satisfied for any joint distribution for X, V that satisfies weak convergence in that — Sn) ^ p{X\V = s) 

whenever s„ —> s. 
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where h{A') ^0 as A' ^ 0. The first term above is the same as that appearing in the proof of Case 2, 
the conditional case with X, y, V finite, Equation It follows that we can transmit at all rates 



R< inf I{X;Y\VA')-h{A')=Rxiv^,{D)-h{A') (37) 

Ed{X, Y)<D 

Now, limA'^o — h{A') = Rx\v{D) (we need to use the technical condition ( O^t for proving 
this), and it follows that we can transmit at all rates less than Rx\v{D). 

The sequence in which we choose n, e, A' depending on the rate R and the probability of error pe is 

1) Choose A' small enough so that R < Rx\v^,{D) - h{A') 

2) Choose e small enough such that R < inf(y^, j^jg^a^g ^±eRx\v^,iD) ~ h{A') 

3) Choose n large enough so that sum of error probal^ilities of events Ei, E2, E3 < pe. 

Finally, we consider Case 6, that of X, y, V non-finite. This is just a mixture of decoding rules for Case 
4 {X non-finite, y non-finite, V finite) and the previous case. Case 5 { X finite, y finite , V non-finite). 

First quantize X,y to size A. This way, we get Px^iv- This reduces the problem to previous case 
where X and y are finite and by combining the decoding rules of Case 4 and Case 5, it is easy to see 
that we can transmit at all rates R < Rxa\v{D + fi'(A)) where g{A) is defined analogous to that in Case 
3. 

Taking A ^ 0, it follows that we can transmit at all rates R < Rx\v{D)- 

Clearly, the technical condition in place of in this case of X,y non-finite, but bounded support, 

is: Va;A G 

(38) 



lim max 



sup / I \ sup / I 

Pxl\V^,=s^X^^\s^')-Pxl\V^,=s^X^^\s^' 



This is just saying that the technical condition of the finite X case should hold for all partitions of X in 
this non-finite case. 

The sequence in which we choose e, n, A, A' to achieve a rate R and probability of error < is 

1) Choose A small enough so that R < Rxa\v{D + g{A.)). 

2) Choose A' small enough so that R < Rx^\v^,{D + g{A)) - h{A'). 

3) Choose e small enough so that R < inf(y^,^^)gp-g i?XA|VA,(^ + 9{^)) - H^') 

4) Choose n large enough so that sum of error probabilities caused by events Ei, E2, E^ add up to 
less than p^. 

Next, we state (without proof) sufficient conditions for the technical conditions. Equations (1^ and 
(l38ll to hold. 

1) Case 5, that is, X,y finite, V non-finite: The following weak convergence condition is sufficient 
for the technical condition (l34b to hold: 

S ^ PX\V=s^ PX\V=s (39) 

2) Case 6, that is, X, y, V non-finite: what we want is that after discretizing X and y, the same 
technical condition should hold. Assuming that px\v=s have densities, the above condition, 

Sa-^ S ^ PX\V=Sc ' PX\V=s (40) 

is sufficient for the technical condition (l38t to hold. 
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B. Unbounded support 

The compact support condition is what allowed us to use quantization to reduce everything to the 
finite- alphabet case where the method of types could work since the number of possible types grew only 
polynomially in the block-length n. Dealing with this requires an appropriate truncation argument. For 
space reasons, we merely sketch the essential ideas here: 

1) Pick a small S > 0. 

2) Pick a sufficiently large compact region Xc x Vc (with the obvious modifications if there is no 
coverstory) so that it satisfies the following properties: 

. P{X^ X > 1 - 5 

. pIXc\V = s) > 1 - (5 for all s e Vc 

• Let Xc, Vc be the random variables X, V conditioned on their values lying within the compact 
region Xc x V,. Then RxM^) > (1 - S)Rx\v{D). 

Given this, the distribution for P(X\V = s) can be written as a convex combination {l—S)Px^\Vc=s + 
^P'x\Vc=s some other distribution P'x^v^^g- 

3) Employ a two-part strategy for generating the random codebook. First, we classify positions in the 
codebook as "clean" or "dirty" or "bad": 

• Mark as "dirty" all positions where Vt is not in Vc- 

• Flip a commonly random iid biased coin with 5 probability of coming up heads for each position. 
Mark as "bad" all positions where the coin turns up heads. 

• All remaining positions are "clean." 

Next, we generate the 2"^ random codewords iid using Px^Vc the clean positions. For dirty 
positions, we draw from Px\v while bad positions are drawn from P'x\v,.- The resulting codewords 
look as though they are drawn from Px\v- 

4) For decoding, look at only the clean positions. If their number is less than (1 — 45)n, declare error. 
Beyond that, we treat it as in the previous cases dealing with compact support, using the appropriate 
quantization and nearest typical neighbor decoding. 

In terms of the probability of error, there is now a new error event Eq which corresponds to there being 
more than 4:5n bad or dirty positions. By the weak law of large numbers (since bad and dirty positions 
arrive no faster than a Bernoulli processes with expected rate 25), this cannot happen very often and so 
P{Eq) as ^ oo. 

The other terms in the probability of error can be bounded by pretending that the attacker knows not 
only the dirty positions, but also the bad ones. Assume it also knows that our decoding rule is going 
to ignore all the dirty and bad positions. With this knowledge, the worst thing it can do is choose to 
allocate no distortion to those positions and spend that distortion over the clean positions. However, this 
only increases average distortion by a factor over the clean positions that figure in the decoding 
process. By choosing 5 sufficiently small, we can be sure that R < RiDijz^))- Everything else proceeds 
as before. 

XI. Stationary-Ergodic Sources 

So far, the information-embedding arguments seemed to depend strongly on the assumption of memo- 
ry lessness. This is what allowed the method-of-types to be used. To deal with more general sources with 
memory, we can just apply a trick similar to the truncation argument in Section IX-BI Once again, in the 
interest of space, we simply sketch the key ideas in the context of finite- alphabet rate-distortion problems. 

Suppose that the source process {Xt} is stationary'^ and ergodic. In such cases, the rate-distortion 
and conditional rate-distortion functions are defined in terms of limits of longer and longer finite-horizon 
problems X*. So, for any t sufficiently long, then R < Rx{D) implies also that tR < Rx\{tD). But 

"since time for us starts at 1, assume that it iias been initialized into its stationary distribution. 
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before we simply pick a t long enough, we need to impose a technical condition that requires the process 
to "mix" appropriately uniformly fast towards its stationary distribution. 

Assume that for every A > 0, /3 > 0, there exists a uniform delay r so that for all t > 0, all possible 
values x\, all A; > 0, and all measurable subsets A of X'': 

P{xl+'/'-' eA\xl = x\) = (1 - A)Pft^t(Xf eA) + XP'iA) (41) 

where P' is a probability measure that can depend explicitly on t, d, x\ while P^^^^ is a measure that does 
not have any such dependence and is within ±P of the stationary probability distribution for the original 
process. 

Essentially, (HTT) just captures the idea that the process has fading memory and that if we wait long 
enough, the process will return to its stationary distribution regardless of what values the process might 
have taken in the past. It is easy to verify that (|4T]) holds for all finite-state stationary ergodic Markov 
chains'^ as well as hidden Markov models with an underlying finite-state stationary ergodic Markov chain. 

With this condition, the codebook construction proceeds in the following sequence: 

1) Pick small enough A,/? 

2) Based on the technical condition, calculate the required delay d to make the process "forget" its 
past. 

3) Pick a t sufficiently long so that ^ is close to 1, and the finite horizon rate-distortion function is 
close to its infinite-horizon limit. 

4) Segment time regularly with t time units of potentially embedded data followed by d time units of 
dead-time. 

5) Use common-randomness to generate Bemoulli(A) random variables used to mark t-long slots as 
being bad. This is done for the entire codebook, not on a codeword by codeword basis. 

6) For the codewords, independently generate the t-long slots that are not bad by drawing from the 
stationary distribution for X*. Draw bad slots using P' from (l4TT) and the prefix of the codeword^"^ 
so far. 

7) Generate the d-length dead-time slots in between by sampling from the appropriate conditional 
distribution once the following t-long slot has been chosen. 

It is clear that every codeword is thus a simulation of the original process with memory. Conditioned 
on knowing where the good slots of length t are, the process is iid from both the encoder and decoder's 
point of view and so reverts to the previous case. The decoder can focus entirely on the good slots viewed 
as an iid process. Once again, the probability of having fewer than a (1 — 2 A) proportion of good slots 
goes to zero. Decoding error can be bounded by supposing that the attacker knew which slots were good 
and what time-segments were "dead-time." Thus, the attacker can choose to concentrate all its distortion 
on the good slots. This increases the average distortion by a factor of at most ) — which is as 

close to 1 as we want. 
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